Menard County
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > Texas > Menard County (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia (0.05)
- Europe > Spain > Galicia > Madrid (0.04)
- North America > United States > Texas > Menard County (0.04)
- (2 more...)
Gradient-free training of recurrent neural networks
Bolager, Erik Lien, Cukarska, Ana, Burak, Iryna, Monfared, Zahra, Dietrich, Felix
Recurrent neural networks are a successful neural architecture for many time-dependent problems, including time series analysis, forecasting, and modeling of dynamical systems. Training such networks with backpropagation through time is a notoriously difficult problem because their loss gradients tend to explode or vanish. In this contribution, we introduce a computational approach to construct all weights and biases of a recurrent neural network without using gradient-based methods. The approach is based on a combination of random feature networks and Koopman operator theory for dynamical systems. The hidden parameters of a single recurrent block are sampled at random, while the outer weights are constructed using extended dynamic mode decomposition. This approach alleviates all problems with backpropagation commonly related to recurrent networks. The connection to Koopman operator theory also allows us to start using results in this area to analyze recurrent neural networks. In computational experiments on time series, forecasting for chaotic dynamical systems, and control problems, as well as on weather data, we observe that the training time and forecasting accuracy of the recurrent neural networks we construct are improved when compared to commonly used gradient-based methods.
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Oceania > Australia > South Australia (0.04)
- North America > United States > Texas > Menard County (0.04)
- Europe > Austria (0.04)
- Instructional Material (0.92)
- Research Report > New Finding (0.46)
On a continuous time model of gradient descent dynamics and instability in deep learning
Rosca, Mihaela, Wu, Yan, Qin, Chongli, Dherin, Benoit
The recipe behind the success of deep learning has been the combination of neural networks and gradient-based optimization. Understanding the behavior of gradient descent however, and particularly its instability, has lagged behind its empirical success. To add to the theoretical tools available to study gradient descent we propose the principal flow (PF), a continuous time flow that approximates gradient descent dynamics. To our knowledge, the PF is the only continuous flow that captures the divergent and oscillatory behaviors of gradient descent, including escaping local minima and saddle points. Through its dependence on the eigendecomposition of the Hessian the PF sheds light on the recently observed edge of stability phenomena in deep learning. Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.
- Asia > Middle East > Jordan (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- North America > United States > Texas > Menard County (0.04)
- (2 more...)
Gaussian random field approximation via Stein's method with applications to wide random neural networks
Balasubramanian, Krishnakumar, Goldstein, Larry, Ross, Nathan, Salim, Adil
We derive upper bounds on the Wasserstein distance ($W_1$), with respect to $\sup$-norm, between any continuous $\mathbb{R}^d$ valued random field indexed by the $n$-sphere and the Gaussian, based on Stein's method. We develop a novel Gaussian smoothing technique that allows us to transfer a bound in a smoother metric to the $W_1$ distance. The smoothing is based on covariance functions constructed using powers of Laplacian operators, designed so that the associated Gaussian process has a tractable Cameron-Martin or Reproducing Kernel Hilbert Space. This feature enables us to move beyond one dimensional interval-based index sets that were previously considered in the literature. Specializing our general result, we obtain the first bounds on the Gaussian random field approximation of wide random neural networks of any depth and Lipschitz activation functions at the random field level. Our bounds are explicitly expressed in terms of the widths of the network and moments of the random weights. We also obtain tighter bounds when the activation function has three bounded derivatives.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Texas > Menard County (0.04)
- North America > United States > California > Yolo County > Davis (0.04)
- Asia > Singapore (0.04)
Out-of-sample error estimate for robust M-estimators with convex penalty
A generic out-of-sample error estimate is proposed for robust $M$-estimators regularized with a convex penalty in high-dimensional linear regression where $(X,y)$ is observed and $p,n$ are of the same order. If $\psi$ is the derivative of the robust data-fitting loss $\rho$, the estimate depends on the observed data only through the quantities $\hat\psi = \psi(y-X\hat\beta)$, $X^\top \hat\psi$ and the derivatives $(\partial/\partial y) \hat\psi$ and $(\partial/\partial y) X\hat\beta$ for fixed $X$. The out-of-sample error estimate enjoys a relative error of order $n^{-1/2}$ in a linear model with Gaussian covariates and independent noise, either non-asymptotically when $p/n\le \gamma$ or asymptotically in the high-dimensional asymptotic regime $p/n\to\gamma'\in(0,\infty)$. General differentiable loss functions $\rho$ are allowed provided that $\psi=\rho'$ is 1-Lipschitz. The validity of the out-of-sample error estimate holds either under a strong convexity assumption, or for the $\ell_1$-penalized Huber M-estimator if the number of corrupted observations and sparsity of the true $\beta$ are bounded from above by $s_*n$ for some small enough constant $s_*\in(0,1)$ independent of $n,p$. For the square loss and in the absence of corruption in the response, the results additionally yield $n^{-1/2}$-consistent estimates of the noise variance and of the generalization error. This generalizes, to arbitrary convex penalty, estimates that were previously known for the Lasso.
- North America > United States > Texas > Menard County (0.04)
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Implicit Gradient Regularization
Barrett, David G. T., Dherin, Benoit
Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations. Furthermore, we demonstrate that the implicit gradient regularization term can be used as an explicit regularizer, allowing us to control this gradient regularization directly. More broadly, our work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent.
- North America > United States > Texas > Menard County (0.04)
- Asia > Middle East > Jordan (0.04)
- Africa > Cameroon > Gulf of Guinea (0.04)
Dissecting Neural ODEs
Massaroli, Stefano, Poli, Michael, Park, Jinkyoo, Yamashita, Atsushi, Asama, Hajime
Continuous deep learning architectures have recently re-emerged as variants of Neural Ordinary Differential Equations (Neural ODEs). The infinite-depth approach offered by these models theoretically bridges the gap between deep learning and dynamical systems; however, deciphering their inner working is still an open challenge and most of their applications are currently limited to the inclusion as generic black-box modules. In this work, we "open the box" and offer a system-theoretic perspective, including state augmentation strategies and robustness, with the aim of clarifying the influence of several design choices on the underlying dynamics. We also introduce novel architectures: among them, a Galerkin-inspired depth-varying parameter model and neural ODEs with data-controlled vector fields.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > Texas > Menard County (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- (3 more...)
- Research Report (0.50)
- Overview (0.46)
Augmented Neural ODEs
Dupont, Emilien, Doucet, Arnaud, Teh, Yee Whye
We show that Neural Ordinary Differential Equations (ODEs) learn representations that preserve the topology of the input space and prove that this implies the existence of functions Neural ODEs cannot represent. To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > Texas > Menard County (0.04)
Resolving motion ambiguities
Diamantaras, K. I., Geiger, D.
We address the problem of optical flow reconstruction and in particular the problem of resolving ambiguities near edges. They occur due to (i) the aperture problem and (ii) the occlusion problem, where pixels on both sides of an intensity edge are assigned the same velocity estimates (and confidence). However, these measurements are correct for just one side of the edge (the non occluded one). Our approach is to introduce an uncertamty field with respect to the estimates and confidence measures. We note that the confidence measures are large at intensity edges and larger at the convex sides of the edges, i.e. inside corners, than at the concave side. We resolve the ambiguities through local interactions via coupled Markov random fields (MRF). The result is the detection of motion for regions of images with large global convexity.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > United States > Texas > Menard County (0.04)
- (3 more...)